Optimizing ETL Dataflow Using Shared Caching and Parallelization Methods

نویسنده

  • Xiufeng Liu
چکیده

Extract-Transform-Load (ETL) handles large amount of data and manages workload through dataflows. ETL dataflows are widely regarded as complex and expensive operations in terms of time and system resources. In order to minimize the time and the resources required by ETL dataflows, this paper presents a framework to optimize dataflows using shared cache and parallelization techniques. The framework classifies the components in an ETL dataflow into different categories based on their data operation properties. The framework then partitions the dataflow based on the classification at different granularities. Furthermore, the framework applies optimization techniques such as cache re-using, pipelining and multithreading to the already-partitioned dataflows. The proposed techniques reduce system memory footprint and the frequency of copying data between different components, and also take full advantage of the computing power of multi-core processors. The experimental results show that the proposed optimization framework is 4.7 times faster than the ordinary ETL dataflows (without using the proposed optimization techniques), and outperforms the similar tool (Kettle).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Technische Universität München Institut für Informatik Lehrstuhl für Rechnertechnik und Rechnerorganisation BMDFM: A Hybrid Dataflow Runtime Parallelization Environment for Shared Memory Multiprocessors

Nowadays parallel shared memory symmetric multiprocessors (SMP) are complex machines, where a large number of architectural aspects have to be simultaneously addressed in order to achieve high performance. The quick evolution of parallel machines has been followed by the evolution of parallel execution environments. An effective parallel environment must be high-level enough so that it is easy ...

متن کامل

BMDFM: a hybrid dataflow runtime parallelization environment for shared memory multiprocessors

BMDFM and the performance obtained running both standard numerical applications and non-trivial adaptive algorithm based applications.

متن کامل

Scalable Locality-Sensitive Hashing for Similarity Search in High-Dimensional, Large-Scale Multimedia Datasets

Similarity search is critical for many database applications, including the increasingly popular online services for Content-Based Multimedia Retrieval (CBMR). These services, which include image search engines, must handle an overwhelming volume of data, while keeping low response times. Thus, scalability is imperative for similarity search in Webscale applications, but most existing methods a...

متن کامل

Optimizing and parallelizing X-ray spectral analysis software for shared memory systems

This thesis discusses the optimization and parallelization of an existing program written in Fortran 90. After profiling the program with different use-cases, two parts that will be optimized are selected. One of these is essentially a matrix-vector multiplication, the other calculates absorption of photons due to photoionization. Parallelization will be done using openMP, since the program mus...

متن کامل

CTL : A Platform - Independent Crypto Tools Library Based on Dataflow Programming Paradigm ( Extended Edition ) ?

The diversity of computing platforms is increasing rapidly. In order to allow security applications to run on such diverse platforms, implementing and optimizing the same cryptographic primitives for multiple target platforms and heterogeneous systems can result in high costs. In this paper, we report our efforts in developing and benchmarking a platform-independent Crypto Tools Library (CTL). ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1409.1639  شماره 

صفحات  -

تاریخ انتشار 2014